The USA TODAY Diversity Index is a number – on a scale from 0 to 100 – that represents the chance that two people chosen randomly from an area will be different by race and ethnicity. In more personal terms: “What is the chance that the next person I meet will be different from me?” A higher number means more diversity, a lower number, less. The index was invented in 1991 by Phil Meyer of the University of North Carolina and Shawn McIntosh of USA TODAY.
Answer: The ACS has data available on State, County, and Tract level. Tract is way smaller than City level, but it is the closest to it. The same levels exist in the deccenial as well.
Answer: B02001_002-B02001_006 estimates the total of the population of a certain race. 002 is White Alone, 003 is Black or African American alone, 004 is American Indian and Alaska Native alone, 005 is Aisan alone, and 006 is Native Hawaiian and Other Pacific Islander alone. B02001_001 is the estimated total of all. B03002_001 is the estimated total in the calculation of hispanic or no hispanic ethnicity. B03002_002 is estimated total of people who are not hispanic or latino, and B03002_12 is the estimated total of hispanic and latino.
Answer: Race: It is a social definition within the country, the standards are set bu the US Office of Management and Budget. This division usually stems from where the person has origins from. Ethnicity: US Office of Management and Budget says it is a social definition which is seperate to the one on race. In this case, they only assume that there are only two different ethnicities: hispanic/latino and non-hispanic/latino. People who identifies themselves as hispanic/latino can also identifiy themselves as any of the races listed by OMB. This is important since there could potentially be some overlapping of people in different categories. Since ACS is talking about how the difference between race and ethnicity in social perception, it is harder to compare races with ethnicities since they can overlap and have different definitions.
Answer: Yes, there will be some room for slight difference between the result and the actual world. This means that there should be some caution making important decisions regarding the results. It is important to be transparent when discussing the result.
library(tidycensus)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## Warning: package 'tibble' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
census_api_key("d339aa0082b0bc4697f233a571cd3cfee61f5c6d")
## To install your API key for use in future sessions, run this function with `install = TRUE`.
#Create vector with all the variables
var_race <- c(paste0("B02001_00", 1:6), paste0("B03002_00", 1:2), "B03002_012")
#Create vector with all the states
states <- c("NY", "CT", "NJ")
#Use the API to get the data
df <- get_acs(geography = "county", variables = var_race, state = states, year = 2018)
## Getting data from the 2014-2018 5-year ACS
#Display the first 5 rows
head(df)
## # A tibble: 6 x 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 09001 Fairfield County, Connecticut B02001_001 944348 NA
## 2 09001 Fairfield County, Connecticut B02001_002 691642 2693
## 3 09001 Fairfield County, Connecticut B02001_003 107215 1230
## 4 09001 Fairfield County, Connecticut B02001_004 2338 505
## 5 09001 Fairfield County, Connecticut B02001_005 49927 908
## 6 09001 Fairfield County, Connecticut B02001_006 487 159
Each of the calculations below will be done by county and not in aggregate.
Step 1:
In the current federal scheme, there are five named races – white, black/African-American, American Indian/Alaska Native, Asian and Native Hawaiian/Other Pacific Islander and an estimate for total population (B02001_001-B02001_006). Ensure that you have collected the proper data from the tidycensus API for these values, as well as the values for the Hispanic population (B03002_001-B03002_002 & B03002_012).
Use the spread function to create columns for each racial group (and the total population). Rename these columns to better reflect the data if you have not already done so.
Calculate each group’s share of the population. This is done by dividing the value for each racial column by the total population column. Create new variables for your data frame for each calculation.
\[ \small RaceProportion_i = \frac{Race_i}{Total_i} \]
#I need to start by saying this looks terrible from a code style perspective. I simply could not get it to work with better formatting for some reason.
#This block of code first transforms the different variables into columns in order to make the calculations easier.
#Second, it changes the names of the variable columns into easier understandable names.
#Lastly, it does the necessary calculations in order to get the RaceProportion
df <- df %>% pivot_wider(names_from = variable, values_from = c(estimate, moe)) %>% select(GEOID, NAME, paste0("estimate_", var_race)) %>% rename(TotalEstimateRace = estimate_B02001_001, WhiteEstimate = estimate_B02001_002, BlackEstimate = estimate_B02001_003, NativeEstimate = estimate_B02001_004, AsianEstimate = estimate_B02001_005, PacificEstimate = estimate_B02001_006, TotalEstimateEthnic = estimate_B03002_001, NotHispanicEstimate = estimate_B03002_002, HispanicEstimate = estimate_B03002_012) %>% mutate(WhiteProp = WhiteEstimate/TotalEstimateRace, BlackProp = BlackEstimate/TotalEstimateRace, NativeProp = NativeEstimate/TotalEstimateRace, AsianProp = AsianEstimate/TotalEstimateRace, PacificProp = PacificEstimate/TotalEstimateRace)
df
## # A tibble: 91 x 16
## GEOID NAME TotalEstimateRa~ WhiteEstimate BlackEstimate NativeEstimate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 09001 Fair~ 944348 691642 107215 2338
## 2 09003 Hart~ 894730 639528 122103 2945
## 3 09005 Litc~ 183031 170083 3528 356
## 4 09007 Midd~ 163368 144844 8654 264
## 5 09009 New ~ 859339 634586 113737 1540
## 6 09011 New ~ 268881 216974 15760 1477
## 7 09013 Toll~ 151269 133656 4621 67
## 8 09015 Wind~ 116538 103567 2644 711
## 9 34001 Atla~ 268539 178914 40319 969
## 10 34003 Berg~ 929999 663672 55497 1695
## # ... with 81 more rows, and 10 more variables: AsianEstimate <dbl>,
## # PacificEstimate <dbl>, TotalEstimateEthnic <dbl>,
## # NotHispanicEstimate <dbl>, HispanicEstimate <dbl>, WhiteProp <dbl>,
## # BlackProp <dbl>, NativeProp <dbl>, AsianProp <dbl>, PacificProp <dbl>
Step 2:
Take each racial group’s share of the population, square it and sum the results.
\[ \small P(Racial_i) = \sum_{i=1}^{n} RaceProportion_i^2 \]
The Census also includes a category called “Some other race.” Because studies show that people who check it are overwhelmingly Hispanic, that category is not used. Hispanics’ effect on diversity is calculated in Step 3.
#Creating a new column called p_racial with the proper calculations. sum() did not work for me here, so I resorted back to simply adding them together manually. I know this is not best practice.
df <- df %>% mutate(p_racial = WhiteProp^2 + BlackProp^2 + NativeProp^2 + AsianProp^2 + PacificProp^2)
df
## # A tibble: 91 x 17
## GEOID NAME TotalEstimateRa~ WhiteEstimate BlackEstimate NativeEstimate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 09001 Fair~ 944348 691642 107215 2338
## 2 09003 Hart~ 894730 639528 122103 2945
## 3 09005 Litc~ 183031 170083 3528 356
## 4 09007 Midd~ 163368 144844 8654 264
## 5 09009 New ~ 859339 634586 113737 1540
## 6 09011 New ~ 268881 216974 15760 1477
## 7 09013 Toll~ 151269 133656 4621 67
## 8 09015 Wind~ 116538 103567 2644 711
## 9 34001 Atla~ 268539 178914 40319 969
## 10 34003 Berg~ 929999 663672 55497 1695
## # ... with 81 more rows, and 11 more variables: AsianEstimate <dbl>,
## # PacificEstimate <dbl>, TotalEstimateEthnic <dbl>,
## # NotHispanicEstimate <dbl>, HispanicEstimate <dbl>, WhiteProp <dbl>,
## # BlackProp <dbl>, NativeProp <dbl>, AsianProp <dbl>, PacificProp <dbl>,
## # p_racial <dbl>
Step 3:
Because Hispanic origin is a separate Census question, the probability that someone is Hispanic or not must be figured separately. Take the Hispanic and non-Hispanic percentages of the population, square each and add them to get the chance that any two people will be Hispanic or not. Use this calculation to create a new variable in your data frame.
\[ \small P(Ethnic_i) = Hispanic_i^2+ Non Hispanic_i^2 \]
#Creating a new column called p_ethnic which is the same calculation as for p_racial just with ethnics instead
df <- df %>% mutate(p_ethnic = (HispanicEstimate/TotalEstimateEthnic)^2 + (NotHispanicEstimate/TotalEstimateEthnic)^2)
df
## # A tibble: 91 x 18
## GEOID NAME TotalEstimateRa~ WhiteEstimate BlackEstimate NativeEstimate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 09001 Fair~ 944348 691642 107215 2338
## 2 09003 Hart~ 894730 639528 122103 2945
## 3 09005 Litc~ 183031 170083 3528 356
## 4 09007 Midd~ 163368 144844 8654 264
## 5 09009 New ~ 859339 634586 113737 1540
## 6 09011 New ~ 268881 216974 15760 1477
## 7 09013 Toll~ 151269 133656 4621 67
## 8 09015 Wind~ 116538 103567 2644 711
## 9 34001 Atla~ 268539 178914 40319 969
## 10 34003 Berg~ 929999 663672 55497 1695
## # ... with 81 more rows, and 12 more variables: AsianEstimate <dbl>,
## # PacificEstimate <dbl>, TotalEstimateEthnic <dbl>,
## # NotHispanicEstimate <dbl>, HispanicEstimate <dbl>, WhiteProp <dbl>,
## # BlackProp <dbl>, NativeProp <dbl>, AsianProp <dbl>, PacificProp <dbl>,
## # p_racial <dbl>, p_ethnic <dbl>
Step 4:
To calculate whether two people are the same on both measures, multiply the results of the first two steps. Use this calculation to create a new column in your data frame. This is the probability that any two people are the SAME by race and ethnicity.
\[ \small P(Same_i) = P(Racial_i) \times P(Ethnic_i) \]
#Creating a new column called p_same which will be used for later calculations.
df <- df %>% mutate(p_same = p_racial * p_ethnic)
df
## # A tibble: 91 x 19
## GEOID NAME TotalEstimateRa~ WhiteEstimate BlackEstimate NativeEstimate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 09001 Fair~ 944348 691642 107215 2338
## 2 09003 Hart~ 894730 639528 122103 2945
## 3 09005 Litc~ 183031 170083 3528 356
## 4 09007 Midd~ 163368 144844 8654 264
## 5 09009 New ~ 859339 634586 113737 1540
## 6 09011 New ~ 268881 216974 15760 1477
## 7 09013 Toll~ 151269 133656 4621 67
## 8 09015 Wind~ 116538 103567 2644 711
## 9 34001 Atla~ 268539 178914 40319 969
## 10 34003 Berg~ 929999 663672 55497 1695
## # ... with 81 more rows, and 13 more variables: AsianEstimate <dbl>,
## # PacificEstimate <dbl>, TotalEstimateEthnic <dbl>,
## # NotHispanicEstimate <dbl>, HispanicEstimate <dbl>, WhiteProp <dbl>,
## # BlackProp <dbl>, NativeProp <dbl>, AsianProp <dbl>, PacificProp <dbl>,
## # p_racial <dbl>, p_ethnic <dbl>, p_same <dbl>
Step 5:
Subtract the result from 1 to get the chance that two people are different – diverse. For ease of use, multiply the result by 100 to place it on a scale from 0 to 100. Create a new column with your USA Today Diversity Index value.
\[ \small DiversityIndex_i = \Big( 1 - P(Same_i) \Big) \times 100 \]
#Create the column called DiversityIndex
df <- df %>% mutate(DiversityIndex = (1 - p_same)*100)
df
## # A tibble: 91 x 20
## GEOID NAME TotalEstimateRa~ WhiteEstimate BlackEstimate NativeEstimate
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 09001 Fair~ 944348 691642 107215 2338
## 2 09003 Hart~ 894730 639528 122103 2945
## 3 09005 Litc~ 183031 170083 3528 356
## 4 09007 Midd~ 163368 144844 8654 264
## 5 09009 New ~ 859339 634586 113737 1540
## 6 09011 New ~ 268881 216974 15760 1477
## 7 09013 Toll~ 151269 133656 4621 67
## 8 09015 Wind~ 116538 103567 2644 711
## 9 34001 Atla~ 268539 178914 40319 969
## 10 34003 Berg~ 929999 663672 55497 1695
## # ... with 81 more rows, and 14 more variables: AsianEstimate <dbl>,
## # PacificEstimate <dbl>, TotalEstimateEthnic <dbl>,
## # NotHispanicEstimate <dbl>, HispanicEstimate <dbl>, WhiteProp <dbl>,
## # BlackProp <dbl>, NativeProp <dbl>, AsianProp <dbl>, PacificProp <dbl>,
## # p_racial <dbl>, p_ethnic <dbl>, p_same <dbl>, DiversityIndex <dbl>
Be sure to properly label your plots and axes. Points will be deducted for incorrect plot titles or axes.
Answer: #The data has a positive skewness. This means that there are more frequent counties in the dataset with lower diversity index score than counties with high.
#Creating a histogram with the frequency of the different Diversity Indexes from the counties
hist(df$DiversityIndex, main = "Histogram for USA Diversity Index", xlab = "Diversity Index", ylab = "Frequency")
library(ggplot2)
#Sorting the data to get the 10 counties with the highest Diversity Index
index_order <- order(df$DiversityIndex, decreasing = TRUE)
index_order <- index_order[1:10]
#Create a bar chart with the 10 counties with the highest Diversity Index
df_index_order <- df[index_order,c("NAME", "DiversityIndex")]
ggplot(df_index_order, aes(x=NAME, y=DiversityIndex)) + geom_col() + scale_x_discrete(guide = guide_axis(n.dodge=4)) + labs(title="The 10 Counties With the Gighest Diversity Index")
library(mapview)
library(sp)
## Warning: package 'sp' was built under R version 4.0.3
#Since some of the calculations could not be done when the geometry columns existed because of variable length limitations, a new df is created with the geometry.
df1 <- get_acs(geography = "county", variables = var_race, state = states, year = 2018, geometry = TRUE)
## Getting data from the 2014-2018 5-year ACS
## Downloading feature geometry from the Census website. To cache shapefiles for use in future sessions, set `options(tigris_use_cache = TRUE)`.
##
|
| | 0%
|
| | 1%
|
|= | 1%
|
|= | 2%
|
|== | 2%
|
|== | 3%
|
|== | 4%
|
|=== | 4%
|
|=== | 5%
|
|==== | 5%
|
|==== | 6%
|
|===== | 7%
|
|===== | 8%
|
|====== | 8%
|
|====== | 9%
|
|======= | 9%
|
|======= | 10%
|
|======= | 11%
|
|======== | 11%
|
|======== | 12%
|
|========= | 12%
|
|========= | 13%
|
|========= | 14%
|
|========== | 14%
|
|========== | 15%
|
|=========== | 15%
|
|=========== | 16%
|
|============ | 16%
|
|============ | 17%
|
|============ | 18%
|
|============= | 18%
|
|============= | 19%
|
|============== | 19%
|
|============== | 20%
|
|============== | 21%
|
|=============== | 21%
|
|=============== | 22%
|
|================ | 22%
|
|================ | 23%
|
|================ | 24%
|
|================= | 24%
|
|================= | 25%
|
|================== | 25%
|
|================== | 26%
|
|=================== | 27%
|
|=================== | 28%
|
|==================== | 28%
|
|==================== | 29%
|
|===================== | 29%
|
|===================== | 30%
|
|===================== | 31%
|
|====================== | 31%
|
|====================== | 32%
|
|======================= | 32%
|
|======================= | 33%
|
|======================== | 34%
|
|======================== | 35%
|
|========================= | 35%
|
|========================= | 36%
|
|========================== | 37%
|
|========================== | 38%
|
|=========================== | 38%
|
|=========================== | 39%
|
|============================ | 39%
|
|============================ | 40%
|
|============================ | 41%
|
|============================= | 41%
|
|============================= | 42%
|
|============================== | 42%
|
|============================== | 43%
|
|=============================== | 44%
|
|=============================== | 45%
|
|================================ | 45%
|
|================================ | 46%
|
|================================= | 46%
|
|================================= | 47%
|
|================================= | 48%
|
|================================== | 48%
|
|================================== | 49%
|
|=================================== | 49%
|
|=================================== | 50%
|
|=================================== | 51%
|
|==================================== | 51%
|
|==================================== | 52%
|
|===================================== | 52%
|
|===================================== | 53%
|
|====================================== | 54%
|
|====================================== | 55%
|
|======================================= | 55%
|
|======================================= | 56%
|
|======================================== | 56%
|
|======================================== | 57%
|
|======================================== | 58%
|
|========================================= | 58%
|
|========================================= | 59%
|
|========================================== | 59%
|
|========================================== | 60%
|
|========================================== | 61%
|
|=========================================== | 61%
|
|=========================================== | 62%
|
|============================================ | 62%
|
|============================================ | 63%
|
|============================================= | 64%
|
|============================================= | 65%
|
|============================================== | 65%
|
|============================================== | 66%
|
|=============================================== | 66%
|
|=============================================== | 67%
|
|=============================================== | 68%
|
|================================================ | 68%
|
|================================================ | 69%
|
|================================================= | 69%
|
|================================================= | 70%
|
|================================================= | 71%
|
|================================================== | 71%
|
|================================================== | 72%
|
|=================================================== | 72%
|
|=================================================== | 73%
|
|==================================================== | 74%
|
|==================================================== | 75%
|
|===================================================== | 75%
|
|===================================================== | 76%
|
|====================================================== | 76%
|
|====================================================== | 77%
|
|====================================================== | 78%
|
|======================================================= | 78%
|
|======================================================= | 79%
|
|======================================================== | 79%
|
|======================================================== | 80%
|
|======================================================== | 81%
|
|========================================================= | 81%
|
|========================================================= | 82%
|
|========================================================== | 82%
|
|========================================================== | 83%
|
|=========================================================== | 84%
|
|=========================================================== | 85%
|
|============================================================ | 85%
|
|============================================================ | 86%
|
|============================================================= | 86%
|
|============================================================= | 87%
|
|============================================================= | 88%
|
|============================================================== | 88%
|
|============================================================== | 89%
|
|=============================================================== | 89%
|
|=============================================================== | 90%
|
|=============================================================== | 91%
|
|================================================================ | 91%
|
|================================================================ | 92%
|
|================================================================= | 92%
|
|================================================================= | 93%
|
|================================================================= | 94%
|
|================================================================== | 94%
|
|================================================================== | 95%
|
|=================================================================== | 95%
|
|=================================================================== | 96%
|
|==================================================================== | 97%
|
|==================================================================== | 98%
|
|===================================================================== | 98%
|
|===================================================================== | 99%
|
|======================================================================| 99%
|
|======================================================================| 100%
#Only keeping the relevant columns
df1 <- df1[, c(1, 2, 6)]
#Sorting df1 so it aligns with df in order to get the proper calculations to the proper rows
df1 <- df1 %>% arrange(GEOID)
#Get the DiversityIndex column from df and give those values to the corresponding row in df1
for (i in 1:nrow(df)) {
for (j in 1:9) {
df1[(9 * (i - 1) + j), "DiversityIndex"] = df[i, "DiversityIndex"]
#print((9 * (i - 1) + j))
}
}
#Create the heatmap of the sampled counties
mapview(df1, zcol = "DiversityIndex", legend = TRUE)
df1
## Simple feature collection with 819 features and 3 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: -79.76215 ymin: 38.92852 xmax: -71.78699 ymax: 45.01585
## geographic CRS: NAD83
## First 10 features:
## GEOID NAME geometry
## 1 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 2 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 3 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 4 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 5 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 6 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 7 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 8 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 9 09001 Fairfield County, Connecticut MULTIPOLYGON (((-73.21717 4...
## 10 09003 Hartford County, Connecticut MULTIPOLYGON (((-73.02054 4...
## DiversityIndex
## 1 62.01604
## 2 62.01604
## 3 62.01604
## 4 62.01604
## 5 62.01604
## 6 62.01604
## 7 62.01604
## 8 62.01604
## 9 62.01604
## 10 62.23732
#I understand that there must be a more efficient way of doing this, but this was the way it worked for me.
#Since I could not find a way to use (label = "COLUMN") to take more than one argument, I decided to create a new column which combines the two desired columns in order for (label = "COLUMN") to display the desired tooltip
df1$LabelDiversity <- paste(df1$NAME, df1$DiversityIndex)
mapview(df1, zcol = "DiversityIndex", legend = TRUE, label = "LabelDiversity")
Answer: It is clear that the counties in and around NYC are the most diverse ones. It is also seen that rural areas seem to have less diversity. This would make sense from a logical point of view where cities tend to be more diverse than rural communities.
Create a new data frame using the tidycensus API with data on median household income by county for New York, New Jersey and Connecticut. Join this data together with the data from New York County. Use ggplot2 (or another visualization library) to visualize the USA Today Diversity Index value and median household incomeon the same plot (Hint: try facet wrap!).
Does there appear to be any relationship between median household income and diversity? How do counties differ on these two measures?
Answer: